Goto

Collaborating Authors

 rank-one matrix


M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

Neural Information Processing Systems

Efficiently approximating local curvature information of the loss function is a useful tool for the optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, limiting their practicality. In this work, we investigate matrix-free approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. The first algorithm we propose is tailored towards network compression and can compute the IHVP for dimension $d$ given a fixed set of $m$ rank-one matrices using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP and query cost $O(m)$ for computing any single element of the inverse Hessian approximation. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction. We give an algorithm with cost $O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing any gradient from the sliding window. We show that both algorithms yield competitive results for network pruning and optimization, respectively, with significantly lower computational overhead relative to existing second-order methods.



A Organization of the Appendix

Neural Information Processing Systems

We remind the reader of some standard facts about kernel ridge regression and the Gaussian/RBF kernel -- see (Shalev-Shwartz and Ben-David 2014) for a reference.


M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

Neural Information Processing Systems

Efficiently approximating local curvature information of the loss function is a useful tool for the optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, limiting their practicality. In this work, we investigate matrix-free approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. The first algorithm we propose is tailored towards network compression and can compute the IHVP for dimension d given a fixed set of m rank-one matrices using O(dm 2) precomputation, O(dm) cost for computing the IHVP and query cost O(m) for computing any single element of the inverse Hessian approximation. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction.


A Non-convex One-Pass Framework for Generalized Factorization Machine and Rank-One Matrix Sensing

Neural Information Processing Systems

We develop an efficient alternating framework for learning a generalized version of Factorization Machine (gFM) on steaming data with provable guarantees.


Nonconvex One-bit Single-label Multi-label Learning

Qiu, Shuang, Luo, Tingjin, Ye, Jieping, Lin, Ming

arXiv.org Machine Learning

An important topic in the multi-label learning research is how to exploit the relationship between different classes of labels in order to improve the learning accuracy or reduce the number of required labels. When labels are partially observed, the low-rank matrix model is one of the most popular models to deal with missing labels. As human-labeling is usually expensive and time-consuming, it is critical to design a robust algorithm which is able to learn the underlying low-rank matrix model on datasets with noisy heavily missing labels. In this work, we consider an extreme scenario where each training instance only has one single label being annotated in binary set 1 out of multiple classes of labels. This scenario is often encountered in realworld systems but less discussed in literatures. For example, it is rare for a user to annotate a news article or a piece of music with many tags, especially when the user is not paid for his annotation. The problem becomes challenging when we have a large number of features and classes. Over the past decades, a number of multi-label learning approaches have been proposed under different settings.


A Non-convex One-Pass Framework for Generalized Factorization Machine and Rank-One Matrix Sensing

Lin, Ming, Ye, Jieping

Neural Information Processing Systems

We develop an efficient alternating framework for learning a generalized version of Factorization Machine (gFM) on steaming data with provable guarantees. When the instances are sampled from $d$ dimensional random Gaussian vectors and the target second order coefficient matrix in gFM is of rank $k$, our algorithm converges linearly, achieves $O(\epsilon)$ recovery error after retrieving $O(k^{3}d\log(1/\epsilon))$ training instances, consumes $O(kd)$ memory in one-pass of dataset and only requires matrix-vector product operations in each iteration. The key ingredient of our framework is a construction of an estimation sequence endowed with a so-called Conditionally Independent RIP condition (CI-RIP). As special cases of gFM, our framework can be applied to symmetric or asymmetric rank-one matrix sensing problems, such as inductive matrix completion and phase retrieval.


A Non-convex One-Pass Framework for Generalized Factorization Machine and Rank-One Matrix Sensing

Lin, Ming, Ye, Jieping

arXiv.org Machine Learning

We develop an efficient alternating framework for learning a generalized version of Factorization Machine (gFM) on steaming data with provable guarantees. When the instances are sampled from $d$ dimensional random Gaussian vectors and the target second order coefficient matrix in gFM is of rank $k$, our algorithm converges linearly, achieves $O(\epsilon)$ recovery error after retrieving $O(k^{3}d\log(1/\epsilon))$ training instances, consumes $O(kd)$ memory in one-pass of dataset and only requires matrix-vector product operations in each iteration. The key ingredient of our framework is a construction of an estimation sequence endowed with a so-called Conditionally Independent RIP condition (CI-RIP). As special cases of gFM, our framework can be applied to symmetric or asymmetric rank-one matrix sensing problems, such as inductive matrix completion and phase retrieval.